perm filename CACHE.8[AM,DBL] blob sn#427718 filedate 1979-03-31 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00010 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\input paper.tex
C00005 00003	\NSECP{INTRODUCTION}
C00018 00004	\NSECP{THE  ASSUMPTIONS}
C00034 00005	\NSECP{SELF-MONITORING & MODIFICATION: INTELLIGENT LEARNING}
C00068 00006	\NSECP{CACHING:  INTELLIGENT REDUNDANCY}
C00098 00007	\NSECP{EXPECTATION-SIMPLIFIED PROCESSING: INTELLIGENT FOCUS OF ATTENTION}
C00105 00008	\NSECP{LEVELS OF ABSTRACTION:  INTELLIGENT KNOWLEDGE STRUCTURING}
C00126 00009	\NSECP{COGNITIVE ECONOMY REVISITED}
C00128 00010	\NNSECP{Acknowledgements}
C00129 ENDMK
C⊗;
\input paper.tex

\TITL{COGNITIVE  ECONOMY}

\vfill

\AUTHO{Douglas B. Lenat, \ Stanford University}

\AUTHO{Frederick Hayes-Roth, \ Rand Corporation}

\AUTHO{Phillip Klahr, \ Rand Corporation}

\vfill

\vskip 10pt plus 2pt

\NNSECP{ABSTRACT}
       Intelligent systems can explore only tiny subsets of their potential
       external and conceptual worlds.   They must develop efficient  forms
       of  representation,  access,   and  operation   to  increase   their
       capacities.  Some of these  forms involve abstraction, caching,  and
       expectation-simplified processing.  These capabilities in turn  can
       combine to provide extremely powerful increases in performance.  For
       example, all three can combine to simplify simulation or, one of its
       related functions, detection of surprising events.  In developing
       economic principles for modern  AI systems, we are forced to favor
       some counterintuitive ideas, policies which are not generally
       followed because their contribution to overall cognitive utility is
       not apparent.  For example, the  nonredundant  storage of properties
       in hierarchical
       inheritance nets  increases many  processing costs  while  providing
       minimal storage cost  savings.

\vfill

\eject

\NSECP{INTRODUCTION}

When we build an AI program, we often find ourselves caught in the following
tradeoff: On the one hand, we want to develop our initial systems quickly
and painlessly, since they are experimental and ephemeral. On the other hand,
we want our systems to run as efficiently as possible, to minimize the
temporal delays, to reduce the magnitude of the cpu time and space required.  
We are torn between
{/it expressability} and {/it efficiency}.


Many AI  researchers {/sl  cum} language  designers have  focused on  {\it 
expressability},  the   problem   of  rapidly   constructing   a   working
experimental vehicle.\foo{Consider the goals of {\:c SAFE} [Balzer 77], 
{\:c PSI} [Green 77], {\:c EMYCIN} [ref], {\:c RITA} [or Rosie?], and 
{\:c KRL} [Winograd & Bobrow 77].}
They have produced some excellent software such  as
Interlisp's {\:c DWIM}, File, and Break  Packages.\foo{[Bobrow & Raphael 73]  is
still an excellent  discussion of  the issues  involved in  such ``AI  language"
efforts.}  
Four fundamental  techniques
utilized in highly {\it expressive} programs are:
({\it i}) reliance upon a very high level language, 
({\it ii}) planning and reasoning at multiple levels of abstraction,
({\it iii}) inference by pattern-directed knowledge sources.
({\it iv}) minimal, nonredundant representation, such as a canonical
generalization/specialization hierarchy.

This paper is attempting to draw attention to the second goal, 
{\it efficiency}.
We present techniques which do not sacrifice expressability, yet enable
programs to (semi-)automatically improve themselves, increase their productivity.
The basic source of power is the ability to predict the way that the program will
be used in the future, and to tailor the program to expedite such uses.

The traditional approach to program optimization has assumed that the predictions
are made by the programmer (e.g., by explicitly providing assertions) or can be
inferred statically by
({\it i}) Knuthian and Dijkstraining analysis of the program,
({\it ii}) carefully designing the program's data structures to be appropriate,
and ({\it iii}) compiling (as in the {\:c FORTRAN H} compiler; optimizing
transformations of the program [Burstall & Darlington 73], etc.)

We advocate the use of methods more {\it dynamic} than those.
One valuable source of prediction comes from the program's experiences as it is run.
If a program can ``learn" from its experiences, it might try applying each piece of
newly acquired knowledge to the task of specializing itself, modifying itself
to run more efficiently in the current runtime environment.
We believe that a program's ``intelligence" can be increased in this way;
that is, by increasing its ability to
acquire  appropriate knowledge, to organize that knowledge, and to refine
the conditions under which that knowledge is recognized to be applicable.

Some earlier automatic programming systems [Darlington & Burstall, Lenat's PUP6,
Jim Low, Kant & Barstow,
others?] were designed to improve programs' efficiency, and many of their
ideas have found their way into the techniques we now describe.  
Those systems 
had program construction, transformation, and optimization as their primary
task; we are proposing general methods to provide 
self-improving abilities
to other kinds of AI programs
(speech understanders, theory formers, expert simulators, paraphrasers,...)




Three ``dynamic'' abilities which make a program {\it efficient} are:


\noindent $\bullet $ 
{\bf Dynamic self-monitoring} (the ability  to sense, record, and  analyze
dynamic usage)  {\bf  and  self-modification} (the  ability  to  use  that
knowledge   to   redesign/recompile    itself   with   more    appropriate
representations, algorithms, data structures; i.e., intelligent learning.)

\noindent $\bullet $ 
{\bf Caching of computed results}
(storing the results of frequently-requested
searches, so they needn't be repeated over and over again; i.e.,
intelligent redundancy.)

\noindent $\bullet $ 
{\bf Expectation-filtering}
(using predictions to filter away expected, unsurprising data,
thereby freeing up processing time for more difficult subtasks;
i.e., intelligent focus of attention.)

A fourth ability, which we shall also discuss in this paper, is one of the
more static techniques mentioned above, a kind of "designing appropriate
data structures":

\noindent $\bullet $ 
{\bf Multiple levels of abstraction}
(redundant representation of knowledge at several levels of abstraction
can be an economical way of structuring a knowledge base, especially if the
program's tasks are large and its resource allotments vary in magnitude;
i.e., this is a technique for intelligent knowledge organization.)



\yskip

{\it Cognitive economy} is the term  by which we describe such  heightened
productivity. 
Computer programs, no less than biological creatures, must perform 
in an environment: an externally imposed set of demands, pressures,
opportunities, regularities.\foo{For a similar recognition of the
importance of the ``task
environment'', see [ref to Newell & Simon's HPS].}
Extending this analogy, we find that
cognitive economy is the degree to which a program is adapted to its
environment, the extent to which
its internal capabilities (structures and processes) accurately reflect
its environmental niche.

\hbox par 2.9in{    % this makes a space for the little figure; see below.
Notice that representing a corpus of knowledge as a  minimal
(canonical)      generalization/specialization       hierarchy,       with
interpretatively-computed inheritance, is {\it  not} in this category:  it
favors expressability,  compaction of  representation, at  the expense  of
performance. It is  true that  a horse  is a mammal,  and a  mammal is  an
animal, and from those two we could compute that a horse is an animal (see
Fig. n), but it is more cognitively economical to store one redundant  arc
than to recompute it frequently.   Psychological studies [ref] indicate
just such redundancies being created and strengthened by repetitions.
}

[[in space above, insert DIAGRAM OF HORSE/MAMMAL/ANIMAL, 
WITH REDUNDANT ARC AS A DOTTED ARROW]]

[[more about what cog. econ is -- and is not -- should go here,probably.]]


Every scientific theory is  constructed in a  rich context of  surrounding
theories, methods,  and standards  which determine  which experiments  are
reasonable ones to perform and which point of view to take when  interpreting
the results  -- in short, a {/it paradigm}. We feel it useful
to articulate the "hard core" of our paradigm (the assumptions our  theory
rests upon) before  delving into more detail about cognitive economy.
For that reason, the entire next section is devoted to a presentation of our
assumptions, including:
a model for intelligence
(Section 2.1),  a  model for  how  intelligent computer  programs  may  be
organized (Section 2.2),  and a  model of the  characteristics of  present
computing engines (Section 2.3).
Section 3 assumes these models, and contains detailed discussions and
examples of the three main components of cognitive economy: 
dynamic self-monitoring and self-modification (Section 3.1),
caching of computed results (Section 3.2),
and expectation-filtering (Section 3.3).
One interesting result that emerges is the prediction that as task size
grows large enough, the techniques which currently are used for expressiveness
(e.g., multiple levels of abstraction) will become important for
cognitive economy -- for efficiency -- as well.

\NSECP{THE  ASSUMPTIONS}


The foundation upon  which our  theories are erected  consists of  several
kinds  of  assumptions:

(i)  We  continually  face  searches  in  more  or  less  immense  spaces;
intelligence is the ability to bring {/it appropriate} knowledge to  bear,
to speed  up  such  searching.   Increasing  intelligence  then  comprises
increasing  knowledge,  its  organization,  and  the  conditions  for  its
applicability.  How are  these new discoveries  made? Empirical  induction
(generalizing from experimental and other observations), analogy, and  the
criticism  of  existing   theory  all   lead  to   new  conjectures,   new
possibilities.

(ii) Intelligent systems  can make  the applicability  of their  knowledge
explicit by representing that knowledge as condition/action rules (If {/sl
situation}  then  {/sl   appropriate  reaction}).  Such   pattern-directed
inference systems (PDIS) can benefit from a schema representation  (frame,
unit, being, script, etc.),  because this adds  structure which the  rules
can then take advantage of.

(iii) Computers currently present us  with limited cycles, cheap  storage,
and expensive knowledge acquisition.

While the  reader may  wish to  turn immediately  to a  discussion of  our
detailed cognitive economy ideas (Section 3), it is useful to examine  the
above assumptions in more detail first.


\SSEC{Our Model of Intelligence}


Many human cognitive activities, including most of those commonly referred
to as ``requiring intelligence", can be cast as searches, as  explorations
through  a  search  space,  meandering  toward  some  goal.   We  call   a
problem-solver more  intelligent  if  he can  efficiently  work  toward  a
solution even in the  face of an immense  search space and an  ill-defined
goal.  Somehow, he is  imposing more structure on  the problem, and  using
that to search more effectively. How might he do this?

According to our model of human intelligence, he accomplishes his task  by
bringing knowledge  to  bear, knowledge  not  supplied explicitly  in  the
problem statement.  This knowledge can be about problem-solving in general
(e.g., how to recognize and break cultural blocks) or about the  problem's
domain specifically (e.g., which groups of atoms can usually be treated as
superatoms).  It may have been learned  long ago, or generated during  the
early phase of work on the problem.

This implies  that a  problem  solver can  become  more effective  by  (i)
increasing his knowledge, (ii) better organizing his knowledge, and  (iii)
refining his criteria for  deciding when various  pieces of knowledge  are
applicable.  In terms of production  (If/Then) rules, these correspond  to
adding new rules, modifying  the data structure in  which rules are  held,
and modifying the conditions (IF parts) of existing rules.

One very basic assumption we are making is that of {\it continuity}:
the environment does not change its character too radically, too rapidly.
If the program abstracts its experiences into a heuristic H, it may be
able to analyze past scenarios and prove that possessing and obeying H
would definitely have led it to its current state of knowledge quicker;
but its only justification for believing that H will help it in the {\it
future} is the empirical evidence for the continuity of the environment.
Learning can be translated into action only if the structure of the domain
remains consistent, similar to the way it was while the learning was going
on.


How is new knowledge discovered? One  route is that of {/it  abstraction}:
condensing many experiences into a  heuristic which, in hindsight, we  see
would have greatly aided us in the  past, which would have speeded up  our
reaching  this   state   in  the   search.    Closely  related   is   {/it
generalization}, often  merely  expanding the  domain  of relevance  of  a
specific bit of knowledge  we already possessed. {/it  Analogy} is one  of
the most useful  techniques; one can  draw parallels not  merely to  other
existing facts and heuristics,  but also to the  {/sl paths} which led  to
their discovery (e.g., even if a result in organic chemistry does not  map
over into the inorganic world, the experiment which led you to that  first
fact may map over into an experiment which {/it will} reveal a useful fact
in inorganic chemistry; even though the analogue of a mathematicl  theorem
may be false  in some  new domain, the  analogous proof  technique may  be
useful). Another path to discovery is {/it specialization} of more general
knowledge. Finally,  we  must  mention the  process  of  {/it  hypothesis,
criticism, and improved hypothesis}, which is a common and powerful method
of spiralling  in toward  precision.  In  mathematics [Lakatos]  and  many
scientific disciplines [Musgrave  & Lakatos],  counterexamples to  current
theories  arise  frequently,  and  their  incorporation  often  demands  a
deepening of the theory, an increase in knowledge.

Experiments such as the AM program [ref] support our assertion that  these
methods can adequately guide even open-ended searches for new  definitions
and conjectures.  But how can an intelligent system be programmed, how can
such systems be organized?


\SSEC{Our Model of Intelligent Program Organization}

The above  remarks  about intelligent  problem  solving apply  equally  to
hardware and wetware alike. Computer programs which are to be  intelligent
must  ultimately  grapple  with   the  tasks  of  knowledge   acquisition,
representation, and refinement. We cannot provide an abolute answer to how
they should be organized, but we  can posit some design constraints  which
have proven useful so far.

A very  general heuristic  in AI  programming is  the following:  If  your
program is  going  to  modify its  own  {\bf  X}, then  make  {\bf  X}  as
separable, clean, and explicit  as possible. In our  case, {\bf X} can  be
instantiated as  "knowledge", or  as  "applicability conditions  for  each
piece of  knowledge".  Thus the  heuristic  advises us  to  represent  our
knowledge in a separate,  clean, explicit form,  say as knowledge  modules
having some fixed structure, and also advises us to keep the applicability
conditions for  each  knowledge  module  separate from  the  rest  of  the
knowledge it contains.

This naturally leads us to  a pattern-directed inference system, in  which
knowledge is broken into  separate modules, each with  an explicit set  of
relevancy tests.  Such systems  arising  in Pittsburgh  may  overemphasize
cleanliness (so-called pure  production systems), while  those arising  in
California may tend to  be a bit  too laid back  (so-called ad hoc  expert
systems), but such variations  should not obscure  their common source  of
power.  The dominant PDIS architecture has knowledge broken into a set  of
condition/action  productions,  rules  of  the  form  IF  {/sl  triggering
situation} THEN {/sl appropriate reaction}.

Having a clean  representation for  rules means having  a clean,  precise,
elegant language in which to express them.  By structuring the  conceptual
knowledge of  the system,  by partitioning  each module's  knowledge  into
several categories, a rule can  condense an entire cumbersome  description
into  a  single  pointer.   The  popular  schematized  representations  of
knowledge (scripts for episodes, frames for static situations, beings  for
specialists, units for  everything) enable a  rule like "If  there are  no
known methods specific to finding new examples of prime numbers,  Then..."
to have its condition coded as a simple null-test on the To-get subslot of
the Examples slot of the schema for Prime Numbers.  By a judicious  choice
of slot  types and  subslot  types, the  system  builder can  reduce  most
triggering conditions  to  such  quick  checks on  the  state  of  various
(subslots of) slots of schemata.

Additional knowledge is required to efficiently locate all the rules which
{/it might} have their conditions satisfied in a given situation, and also
to decide which rules to actually  fire (obey) among those whose IF  parts
have actually triggered (been satisfied).

The location of  potentially-relevant rules is  facilitated by  organizing
the  rules   into  some   structure.   For   example,  AM[ref]   organized
mathematical concepts into a generalization/specialization hierarchy,  and
tied each rule  to its most  relevant concept. To  find rules  potentially
relevant to concept C,  AM then needed  only to look  at the rules  tacked
onto C, C's generalizations,  {\it their} generalizations,  and so on.  By
inheritability, any  rule  relevant  to judging  Objects  in  general  was
(potentially) relevant  to judging  an Abelian  group; any  technique  for
estimating the value of any function was relevant to estimating the  value
of Ackerman's function.

This requires an explicit focusing of attention, a selection of a "current
topic" C from which the above rippling proceeds. We favor the  maintenance
of a separate  data structure  for this purpose,  an agenda  of topics  to
consider, of subtasks  to work  on. Knowledge may  be brought  to bear  to
select a topic,  then the above  quest for potentially  relevant rules  is
carried out, then they are examined  to find the truly relevant  satisfied
set, and finally some subset of those are fired off.



\SSEC{Our Model of (Present) Computers}


Since we  are  considering the  problem  of building  computer  models  of
intelligent  behavior,   many   of   our   assumptions   deal   with   the
characteristics of present-day computers. They are the symbol manipulators
we use, but were not designed for that general purpose.

[RICK:  FILL THIS SECTION OUT; 
	2 BEAUTIFUL PARAGRAPHS WILL DO IT. BASIC IDEA WAS:
	Computers currently present us with limited cycles, cheap storage,
	uniprocessing, and expensive knowledge acquisition.]


\NSECP{SELF-MONITORING & MODIFICATION: INTELLIGENT LEARNING}

  \SSEC{DYNAMICALLY MONITORING ITS TASK ENVIRONMENT}

Rather than working on a problem steadily, it may be better to expend some
time {\it learning} about that problem, its domain, or problem solving in
general. Note this encompasses traditional education (being
told), empirical research (observing),
and theoretical exploration (predicting).
While this appears almost tautologous vis a vis human
strategies (especially to those of us who have
rationalized spending twenty years ``in school''), very few programs to date
have employed any of these forms of learning to aid them in their operation.

Some programs (and wish-lists for program capabilities) have included learning
more about the problem being worked on;\foo{ McCarthy's advice-taker and
Balzer's dialogue system (to get airline reservation programs written) would
be told, Gelernter's theorem-prover (employing diagrams as analogic
models) could observe, Lenat's AM
could observe and predict.}
we are stressing learning about the
ways the program is being currently employed.

\yskip

\noindent {\bf1.  Learning by being told.} \ 
There are many situations in which a user could provide advice to a program.
Some such advice would be task-specific (e.g., ``Integration by
parts won't work on this problem''; ``This is a very hard
problem''), some will have to do with the next several tasks to be submitted
(e.g.,  ``Problems will be supplied in order of
expected difficulty''), and some
will apply permanently to the nature of the program's functioning
(e.g., ``Answers must be produced in real time''; ``Hard problems will be
starred'').  The last kind of knowledge can be reflected permanently in
the design of the program, in its code.
The first kind of knowledge can be input as part of the statement of the
problem (``don't use integration by parts''), as a list of constraints or
preferences. The middle kind of knowledge is perhaps the most difficult to
incorporate.  We all have experienced conscious shiftings during interactions
with a  program, a switch from one phase to another, yet have had no way to
communicate this knowledge to the program.

This paragraph is a plea to consider the ultility of such intermediate-level
advice, and to
provide programs the opportunity to accept and use it.
A fixed (static) approach would be to consider some various modes
(Debug, Demo, Production, Input,...) and permanently build in to the code
all the required changes for each mode. A flexible (dynamic) approach would
be to provide a larger language in which the user could describe his current
mode of usage of the program (``USAGE of KNOWLEDGE-BASE
will-be INCREASED VERY-MUCH''), and which would then be 
translated by the program into
changes in its data structures, algorithms, verbosity, core management
strategies, etc. (in this case, heavy usage of the knowledge base might
cause the program to swap out its compiler, optimizers, and other code,
enabling it to hold many more knowledge chunks at a time in core.)

\yskip

\noindent {\bf2.  Learning by observing.} \ 

Spending a little time/space reconfiguring to reflect the current state
of the world (the nature of the recent and succeeding inputs) seems
worthwhile; the problem is how to know what that abnormal nature of the
environment is. In the last paragraph, we discussed the case where the
user explicitly tells the program. But this is often impossible (as when
the
user is unaware of the irregularity, or when the ``user'' is nature), and almost always
always a nuisance (learning the language in which to express such advice,
remembering to do it, taking the time to do it). 

The next step seems to be to allow the program to {\it infer} such facts
about its run-time environment from observations.
What kinds of self-monitoring can be usefully employed?

The method which comes to mind first, perhaps, is to save a complete
record of everything that ever happened to this program: all the inputs
it received (when, from whom), function calls and accesses it made internally,
paths it chose/rejects (and why),
and outputs it generated.  The program could in principle
deduce, from such an exhaustive
history-list, patterns such as ``very light usage of
compiler''; ``SQRT is always called
on the result of SUM-OF-SQUARES''.

The spatial cost of such an archive, and the temporal cost of the inference 
procedures
in extracting useful patterns, are prohibitive even for dreamers
such as we. The more economical approach is to analyze in advance certain
features we would wish to look for in a complete history record, and then
to have the program record just enough dynamically to gather and maintain
those features. Some of these will be domain-independent, will
be commonly useful:
Frequency of calls on various functions, requests for subtasks, etc.;
Patterns in the arguments to calls on certain functions and inputs to the program;
Patterns in the sequence of calls on functions;
Patterns in functions' results (e.g., F34 always seems to
return a User-Name).
Some of the features will be domain-dependent specializations of the
preceding features (``Average degree of input equation'' is a special
``pattern in the input.'')

\yskip

\noindent {\bf Learning by predicting.} \ 

The theoretician and the engineer share
a powerful technique: manipulating
a model of the desired phenomena, and thereby generating predictions.  
This is a third route to learning, to acquisition of new knowledge.
How might programs utilize it?
What sorts of theories, of models, can be employed to generate new
information?

{\it Models of the program's users} can be valuable. This includes
maintaining and extending profiles of individual users and whole
groups. When a user opens a dialogue with Eurisko with the word
``Let", it is a safe bet that he's a mathematician.  Later, when he
uses the word ``function'', his membership in that group makes it easy
to determine which meaning he has in mind
(instead of, say, what a physiologist or a computer scientist might mean
by ``function.'')
When the program wishes to
mention the square-root of -1, it uses its Mathematician model to
choose an appropriate notation ({\it i} instead of, say, {\it j} for
engineers).  Models can be arranged in a hierarchy, with inheritance
of features from more general groups to their subsets, a lattice
ultimately terminating in individuals' models.  User inputs would
be recognized by various models, which would then hypothesize membership
in that group (or explicate some future conditions which would confirm or
deny such inclusion).  Once triggered in this way, the groups which the
user appeared to belong to (and their generalizations) would actively
filter his inputs and outputs, could affect the program's actions (e.g.,
changing the worth of the ``Proof" concept, which in turn would change
the priority rating of tasks on an AM-like agenda, which would alter
the program's choice of what work on next.)
New user models can be created either top-down (i.e., when a model has
too many instances, or when it's proven to be very valuable, try to split
it into several specializations, each having a share of the general model's
instances) or bottom-up (i.e., notice that a group of individuals share a
set of common characteristics, and create a new user group out of them).

The preceding kind of prediction is based on a theory of user types, embodied
in a set of ``typical representative'' models.  The same source of power can
be used in many ways: stereotyping of events (i.e., scripts), stereotyping
of static complexes (i.e., frames), stereotyping of communities (i.e.,
actors, beings, contract nets).  In all these cases, the source of power
lies in quick recognition of probable membership in (applicability of) some
schema, after which all the rest of the schema
(plus knowledge about all its generalizations) is presumed to be relevant.
Programs might build up and utilize scenarios of how problems are solved
typically in their domain.  The building-up could be top-down via specialization
or bottom-up via generalization, or even ``sideways" via analogy. The 
models would be used for prediction (recognize relevant models and activate
them, have them hypothesize probable future situations).
Together, the models would form a theory of how problems are
solved in that domain.


  \SSEC{DYNAMICALLY MODIFYING ITSELF}

Tasks can be specified anywhere from a pure ``What'' (desired
input/output behavior) to a pure ``How'' (precise algorithms, data structures,
representations, formats).  It has been observed [Feigenbaum ref] that one
of the goals of AI research is to push our ability from the How end toward
the What end of that spectrum.  This often takes the form of designing a new
higher-level language than any that hitherto existed, one which takes over
more of the program implementation.  We can see this as far back as
assemblers replacing machine language, and as recently as KRL and the 
Molgen Units Package which extend Interlisp.  

Assuming, then, that the user
desires merely to specify What, an intelligent language (automatic code
synthesizer) must decide How.  Upon what can it base its design?
Previous efforts have employed knowledge about program transformations
[Burstall & Darlington][Balzer], programming techniques [Green et al],
task-specific knowledge [Green & Barstow sorting], and
 -- if the input language is at all complex -- information to enable
the succesful translation of the input specification into a sufficiently
precise specification of What [Lenat PUP6]. 
A couple recent efforts have begun to guide the decision-making during code
synthesis by algorithmic analysis [Kant] or even by aesthetics
[Lenat AM][Kevin Kahn IJCAI paper].  But all these methods are 
{\it static}: they feed off a fixed base of initially-supplied knowledge,
heuristics, constraints, suggestions, facts.  The user supplies a
specification for the desired program, and the knowledge base is employed
to produce that target. The specification may be incremental, and some of
the synthesis may be by dint of synthesizing code fragments, running them,
trapping errors, and then debugging them.  

There still remains another potentially valuable
source of information: 
the history of the program as it is run.  The last section sketched ways
in which dynamic program performance could be self-monitored.
This section draws attention to the possibility of applying such usage
data to modifying -- or in the extreme re-synthesizing -- the program.

For example, suppose a simulation program were specified, and on the basis of
the user specification an event-driven system appeared to be the best design
choice.  During subsequent usage, the majority of questions refer to
precise times of event occurrences. If this had been part of the initial
specification, the design decision would have been made differently, say in
favor of a synchronous clock-driven world.  It would be no less useful a
change in the system even at this late date.  But the environment might keep
changing in various ways, slowly, perhaps even cyclicalyy.
The attractiveness of having a system which could automatically adapt
itself to its current runtime environment is apparent.

The simplest kind of adaptive abilities along this line would be to choose
a data structure, a representation, an algorithm, out of a set of well-known
choices
-- and to reserve the right and the ability to change that decision later.
One could gather knowledge about the relative strengths and weaknesses 
of each choice, much as Kant's system actually possesses. But this knowledge
could be in the form of rules which ask not only about static data, but which
ask about the runtime environment of the program as well. E.g., ``If
one conjunct appears to be false more often than any other, make it the first
one tested in the conjunction''; ``If each person will use the program only
once, briefly, it is not worth building a model of each such user'';
``A function that is used frequently and changed rarely should be
compiled.''
Yes, such queries could be made intially, during program synthesis, but
our suggestion of automating the gathering of such data is a step further
than that away from How, toward What, toward artificial inelligence.

There is another possibility, much more exotic than the previous one,
which deserves attention: the discovery -- in ``real time" -- of new
types of data structures and representations which would be useful
for the program at hand, in its current environment.  

Defining a new data structure is not as difficult as it may first appear:
a data structure can be thought of abstractly in terms of a set of operations
one wants to be able to perform on it (e.g., First, Last, Next, Previous for
lists).  The set of operations need not be minimal, and need not concern itself
with which operations are most important, etc.  The runtime usage of such
a data structure can be sampled, and that can guide the search for an
ever more appropriate implementation (e.g., in one usage, it might be
very useful for Last to be efficient, and TCONCing would be used; in another
environment, it might be useful for Next and Previous to be efficient, and
a doubly-linked implementation would be best).  Dave Barstow is currently
researching issues such as these.
The two points here are (i) the implementation of a data structure
may depend upon how it is commonly used in a particular spot in a program,
by a particular user, and (ii) a new kind of data structure may be defined
abstractly merely by selecting a set of operations; this choice, too, can
be made by examining the runtime data as it accumulates.

Defining a new representation is not quite so neat. In fact, humans have only
managed to find a handful of representations to date. In lieu of any constructive
suggestions along that line, we shall focus on {\it extending} a given representation.
For a schematized (frame with slots) representation, this would be the definition
of new types of slots.  The Eurisko program has this capability, and we shall
briefly describe how this mechanism functions.

First, we isolated a few major ways in which new slots are defined from old
ones: {\bf STAR}  (e.g., Ancestor $=↓{df}$ Parent$↑*$ which means a Parent of a
Parent of... of a Parent of me; the general case should be aparent);
{\bf UNION} (e.g., Parent $=↓{df}$ Father $\union$ Mother); {\bf COMPOSITION}
(e.g., First-Cousin $=↓{df}$ Child$\circ$Sibling$\circ$Parent; i.e., any
child of any sibling of either of my parents); {\bf DIFFERENCE}
(e.g., Remote-ancestor $=↓{df}$ Ancestor $-$ Parent).
These four operators which define new types of slots 
($*,\ \union,\ \circ,\ -$) are called {\it slot-combiners}.

Next, we add to AM's knowledge base a concept for each type of slot, and
a concept for each type of slot-combiner. Figure n shows some of the
Generalizations and Star concepts. Note in particular the way in which
Generalizations is defined as Genl$↑*$ -- i.e., immediate generalizations,
{\it their} immediate generalizations, and so on. 

[2 concepts]

Finally, we modify the way in which slot entries are accessed. The
traditional way in which (GET PRIMES GENERALIZATIONS) would have worked was
to examine the property list of PRIMES, looking for the attribute
GENERALIZATIONS; if found, the entries listed there would be returned.
If, as in Fig. n+1, there is no such attribute, the call upon GET will
return NIL (the empty list).

[network of conepts from primes up]

We modify the way in which (GET C F) operates. In case the F attribute of
C has no entries (or doesn't exist),  we examine the definition of F
and -- if one exists -- try to use it to compute the entries that could
legally be stored on the F attribute of C.  More formally, before quitting
we try to (GET F DEFN), and if that succeeds we apply it to C.
Let's go through the example of (GET PRIMES GENERALIZATIONS). As we
see in Fig n+1 above, there are none recorded. So GET now calls itself
recursively; our original call is replaced by
((GET GENERALIZATIONS DEFN)  PRIMES).
But as Fig. n shows, there is no slot labelled DEFN on the concept for
GENERALIZATIONS. So we recur one more time. By now our call has become
(((GET DEFN DEFN) GENERALIZATIONS) PRIMES).
Here is part of the DEFN concept:

[Defn concept]

Luckily, it does have a DEFN slot, so we end the recursion.
Applying the entry stored there to the argment ``GENERALIZATIONS'',
we see our original call becoming transformed into
(((GET (GET GENERALIZATIONS SLOT-COMBINER) DEFN)
  (GET GENERALIZATIONS BUILT-FROM))
 PRIMES)

The slot-combiner of Generalizations is ``Star", and the argument
(old slot) which it is built from is ``Genl".  So this turns into
(((GET STAR DEFN) GENL) PRIMES).
We see above that STAR has a Defn entry; it turns the preceding call into
(($\lambda$ (c) (CONS c (self (GET c GENL))))   PRIMES). 
This function first calls for (GET PRIMES GENL), which is NUMBERS, then calls
itself on NUMBERS; that in turn calls for (GET NUMBERS GENL), which is
OBJECTS, and calls itself recursively on OBJECTS; that calls for
(GET OBJECTS GENL), which is ANYTHING, and the next recursive call terminates
when it is discovered that ANYTHING has no GENL (no immediate generalization.)
The list CONStructed and returned is thus (PRIMES NUMBERS OBJECTS ANYTHING).
These four items are the legal entries for the GENERALIZATIONS slot of
PRIMES, according to the definition of GENERALIZATIONS.  

Notationally there is no distinction between slots which are ``primitive"
(values actually stored as attributes on a property list) and slots which
are ``virtual" (values must be computed using the slot's definition).
A heuristic might refer to the Generalizations of Primes without knowing,
or caring, whether that initiated a single access or a dizzy chase.

To define a new kind of slot, then, one need merely specify one of the
slot-combiners, and list the old pre-existing slots from which it is built.
Thus we might define a new slot, by creating a new concept (calling it,
say, ``DG"), filling its
Slot-combiner slot with the entry ``Difference", filling its 
``Built-from" slot with the arguments ``Generalizations Genl".  This would
be a new kind of slot, one which returned all generalizations of a concept
except its immediate generalizations; the call (GET PRIMES DG) would return
(PRIMES OBJECTS ANYTHING).   

It is only a small extension to see how new
kinds of slot-combiners can be defined. For instance, one which took
two old slotnames as arguments, f and g, and defined a new slot which was
f$↑*\circ$g$\circ$f$↑*$, would be extremely useful
(the Examples slot is defined as Spec$↑*\circ$Immed-Exs$\circ$Spec$↑*$, e.g.)

Here is another example of how the expanded GET works:
we consider an access on the PARENTS slot, when that slot is not primitive,
but rather is defined as the union of slots for father and mother.

(GET MERLE PARENTS)

((GET PARENTS DEFN) MERLE)

(((GET DEFN DEFN) PARENTS) MERLE)

(($\lambda$ (s) ((GET (GET s SLOT-DEFINER) DEFN) (GET s BUILT-FROM))) PARENTS) MERLE)

(((GET (GET PARENTS SLOT-DEFINER) DEFN) (GET PARENTS BUILT-FROM)) MERLE)

(((GET UNION DEFN) FATHER MOTHER) MERLE)

((($\lambda$ (s1 s2) ($\lambda$ (c) (LIST (GET c s1) (GET c s2)))) FATHER MOTHER)  MERLE)

(($\lambda$ (c) (LIST (GET c FATHER) (GET c MOTHER)))  MERLE)

(LIST (GET MERLE FATHER) (GET MERLE MOTHER))

(LIST SID BETTY)

(SID BETTY)

\yskip

Besides the issue of how a new slot {\it can} be defined, which is what
we have been illustrating, we should consider how 
a program is to know {\it when/how} to define a new one.
The new type of slot might be defined for purely
exploratory reasons (e.g., it's aesthetic to define ``first cousins'': the
specializations of the generalizations of a concept).  The slot's definition
might be based soundly upon need -- or the absence of need
(e.g., from the kind of monitoring
described above, 
notice that many concepts have a large number of entries for their F slot,
thence
conclude that the F slot should be specialized, that
several new slots should be created which cover it, which each have
fewer entries on the average than F had;
e.g., the Examples slot is heavily used, so new slots like Boundary-examples
and Typical-examples would be worth defining).  The new kind of slot-combiner
defined in the previous paragraph could be borne out of the fact that those
slots which are most useful in the system (Isas, Examples) share that common
form; or it could be borne purely from symmetry, and the existing good
uses of it would be noticed only in retrospect.

We close this section with a further exotic possiblity.
If the task domain of the system is the
exploration of (a domain isomorphic to) programming, some new
``theorem" might occasionally be discovered which can be incorporated into a
data structure or
algorithm (or even representation) to speed it up or supplant it.

\NSECP{CACHING:  INTELLIGENT REDUNDANCY}

After computing some result, it is typically put to some use, and the result
itself is typically forgotten.  If needed again, it is recomputed.
Hardware designers have long recognized (both theoretically [ref] and
empirically [ref Cray?])
the tremendous gain in efficiency affordable by {\it caching}: recording
the result, at least temporarily, in case it is called for again soon.
If that occurs, then no time need be spent recomputing it.

Transferring that notion from hardware to software, and generalizing it
slighty, we have the philosophy we shall call {\it software caching}:
modifying memory to save computed results to speed subsequent accesses.

The simplest case of this is just after computing F(x), for some
function call F with argument x, remember the value that was returned.
Whenever F is called, we first check to see if a cached value is stored
(e.g., associative retrieval on function/argument/value triples; e.g.,
scanning a list of of function/argument pairs for which values are
recorded, or hashing F and x, etc.)  If no cached value is found,
F(x) is computed and cached for future usage.
If a cached value is found, then we must decide whether to accept it,
or else to ignore it, recompute it, and store the new value there.

But even this extravagant record-keeping is not complete: we have not saved
information about the {\it process} of computing F(x), the path that was
followed, the traps, the blind
alleys, the overall type of successful approach, etc.
While much more costly to record than simply the value returned by F,
this type of information can often be of extreme usefulness, especially
for the purposes of generalizing, of monitoring the runtime environment
of the program, of leading to new and better algorithms and data structures
(see the preceding section).
The analogue in chess, e.g., was first hinted at by Newell, Shaw, and Simon
in their early chess paper, then elaborated by Berliner
in his thesis [CAPS]: the idea that you should abstract out a
description of a path through the search tree, so
that the remainder of the search could partake of it.
If one path leads to a discovered check against you, so will many others.
You would be wiser to keep that threat in mind, instead of discarding all
your state information and blindly backtracking.

[[Rick and Phil:  are either of you up to working out some
rough calculations which demonstrate
the extra cost and benefits in some situations???
Choose situations that we do (or should!) use throughout
the paper as running examples.]]

Consider the way in which (GET PRIMES GENERALIZATIONS) was computed by
Eurisko, as explained in the last section.  Several function calls were
required, and a fair amount of cpu time was expended. It would be quite
economical to store the value once computed, to save time in the future.
Eurisko simply stores (PRIMES NUMBERS OBJECTS ANYTHING) as the value of the
GENERALIZATIONS attribute of PRIMES.  When the call on GET is reissued, it
tries to access this very spot.  Though it failed previously, it now would
succeed, and it would return the cached list almost instantly.  No computing
would be done, the concepts of STAR, DEFN, and GENERALIZATIONS needn't be in
core now, and no space (even temporary space) would be required.  We might
prefer to enclose cached values within brackets, to indicate their nonprimitive
status.

We saw earlier that notationally (e.g., to a rule) there was no need
to distinguish an access of a primitively-stored slot from an access of a
virtual, computed slot.  Now we see that once the value is computed and
cached away, there is no telling it from a primitive one.
Thus, with but a small one-time cost, our program runs as quickly as if
all slots were primitive.

Note that in computing Generalizations of Primes, it was necessary to call for
(GET GENERALIZATIONS DEFN), and the value of this virtual slot was also
computed.  The Eurisko policy is to cache this value also. It is useful 
because, when a request such as (GET DUCK GENERALIZATIONS) is received, it
would otherwise have to be recomputed all over again.  The definitions of
slots are very slowly -- if ever -- changing, hence the recomputation of
Defn of Generalizations is quite a rare event.  Caching that value must be
cost-effective in the long run.

In general, we see that caching a value for slot F of concept C is
most applicable when the value can be expected to vary quite infrequently.
In Eurisko, this gives us the flexibility to redefine a slot's definition
if we wish, but (since this will be rare) the program will run just about as
efficiently as if that capability
were absent.  This is analogous to compiling: so long as the definition of
a function doesn't change too often, it's undisputedly worthwhile to compile
it.   The caching of DEFN and GENERALIZATION slots is not in any way special;
any virtual slot can be cached.

The careful reader may have spotted a difficulty in our caching policy:
what happens if the value {\it does} change?  One way to check for this
would be to recompute the value, but of course if this is done too often
it negates the benefits of caching.  Eurisko faces this problem by adding
some extra arguments to GET, parameters which describe the
cost/benefit curve
(for each amount of resources expended and each
degree of precision of the result, this curve specifies precisely how
acceptable that resource consumption / accuracy of
solution pair is.)  One extreme of this is to provide
an ironclad limit on resources [ref. Boborow's resource-
limited computation], and say ``Do the best you can
within these time/space/... limits."  Another extreme
(most common in present-day programs) is to specify
an ironclad goal criterian (``Find an answer which is at
least this good, no matter how long it takes to get.")
We are making the reader aware of the space in between
the two extremes.

To bring us back from the lofty heights of generality, let's see in more
detail how Eurisko does this. First, we linearized this space:  we picked a
few telling parameters of it, and said that a description
was merely a list of values for these parameters.
When a call on GET was executed, and a cached value was encountered, a
formula relating these extra arguments to GET would then determine whether
to accept that cache, or to ignore it and recompute the value.
Eurisko's parameters (extra arguments to GET) are:
cpu time to expend, space to expend, whether the user can be asked about
this, how fresh the cache must be, minimum amount of resources which must have
been expended at the time the cache was written.

So (GET PRIMES GENERALIZATIONS 200 40 NIL 3 0) would allow any cached
value to be accepted if it appeared that recomputing would take more than
200 milliseconds, or would use up more than 40 list cells, or if the value
had been put their less than three Cycles ago (in Eurisko, a Cycle is the
time it takes to work on the top-rated task on the top-rated agenda).
Otherwise, the cache would be ignored, a newer value would be computed, and
it would be stored away.  With it would go the information that (i) it was
a cached value, not a primitive one, (ii) how long it took to compute,
(iii) how many list cells it used up in computing this value, (iv) the current
Cycle (number of tasks worked on so far). The above call on GET might result
in this value being stored on the GENERALIZATIONS slot of PRIMES:
(*cache*  (PRIMES NUMBERS OBJECTS ANYTHING)  54  6  9).


[[	Contrast with psychological conjectures of cognitive economy
		(e.g., Collins&Quillian, KRL, ...).  More like $HR↑2$ Plasticity
		model of storing all retrieved paths as direct links
]]

[[I can gues what this means, namely the horse isa animal being
better than horse isa mammal, but you (Rick?) should do this section.]]
[[Mention the psych. studies that show that people do develop direct association
between words that are co-retrieved often. Thus dog and animal are closer
than dog and mammal, even tho dog-->mammal--->animal in the hierarchy.]]


The caching process involves storage and updating. For both of
these aspects, we can discuss details of when, why, how, and what.
The decisions that arise are made with the guidance of heuristic rules,
and below we present many of those heuristics.  We have omitted many of the
very general ``common-sense" heuristics, and those which deal with details
of Lisp programming. We also have specified the rules in Enlgish, rather than
in Lisp; we have freely used expressions like ``recently", although in any
encoding that would have to be made much more precise.
Besides storing and updating, a common decision to be made is whether to
accept a cache or not; this has been discussed above and will not be duplicated
in the heuristics below.

Finally, note that these heuristics are not {\it always} relevant; they are
applicable for many AI programs, running on PDP-10-sized computers, running
Interlisp, with a human user watching and interacting.
There would be other rules for other machine architectures and languages.
In other words, one could say that we have only partially specified the
IF-part (left hand sides) of the following heuristic rules. An alternative
would be to build up an entire knowledge base of heuristics just to determine
the applicability, the domains of relevance, 
of the following (as a function of machine and language.)


[[the following will be in 3 boxes or figures]]

STORAGE PRINCIPLES
------------------

       When (and when {\it not} to

		Every time you have to call a lower order function.
		You called a function and it took a long time to return.
		The value returned has not changed since the last time you called it.
		The amount of time to recompute the value would be very high.
			E.g., after a lucky accident, where a repeat might
			be very costly or even fail entirely to duplicate
			your recent success at getting a solution.
		If (freq of usage)x(avg cost to recompute)/(freq of change)
			is high, then it pays off.
		If the value is so critical that the cost-benefit criteria
			(eg the resource limits are enormous) always favor
			recomputation rather than chancing a blunder,
			Then don't cache.
		If the function evaluates as fast as the caching mechanism itself,
		  	or if space is too tight, don't cache.
		F is called with the same argument x very often.


      Why

		Temporal cost of recomputing is higher than spatial cost of storage.
		Related to ``When" rules above; namely: in those situations, it's
			cost-effective to store a cache.
		Analogy to utility of hardware caching.
		Ability to utilize cached values as usage data to induct upon.

      Where

		The value to be cached should have a precise storage place.
		One extreme is an assertional database, with {\it no} structure.
		We prefer to add slot and subslot structuring, with some
		domain-specific tinkering of those two sets (especially 
		the former, the set of slots.)
		Thus an extreme example of prime numbers should be stored
		on a slot called extreme-examples, on a schema called primes.
		At access time, we look in the precise spot(s) where X would
		be, and if it isn't there, we compute and cache it right there.
		At worst, there should be a precise {\it set} of places where X
		might be, a well-defined path to search down for X (e.g., 
		a conjecture about prime numbers would either be an Example of
		Conjectures, or a Conjecture of Primes; if not in either of those
		places.)

      How

		Before (re)computing F(x), the program presumably searched in the
			``right place'' (see ``Where'' above) to find a cache of
			F(x) if one existed. At that time, set a pointer to that
			place, and when the value is found, place it there.
		Called functions might suggest how to cache their value in higher
		    calling caches 
			E.g., ``my value changes often so compile my definition
			and cache that, but don't waste resources caching this value.''
		Cache should be transparent and discardable (should be able to throw
		    	them all away if space is needed).
		If there are multiple places to store it (e.g., you learn that
			X is a PARENT of Y, but PARENT is a virtual slot defined as
			MOTHER $/union$ FATHER; if the cache is to be discardable, then
			we must store X under either Y's MOTHER or FATHER slots.)
			Then here are some strategies one might use:
			  Store it redundantly in several (all possible) places
			  Pick one of the places at random and store it there
			  Look up information about each place, to decide
			  Pick one of the places accurately by reasoning, testing, etc.

      What

		Store the returned value in any case.
		Good policy to store a description of the {\it call} which led to
			this value being computed.  E.g., make sure you
			store the amount of resources allotted, the
			time of day and the relative place 
			in the computation (e.g., task number), whether
			the user was allowed to be queried, etc.
		Along with the call, store data about the actual path to a solution.
			This includes algorithms used, time it took to compute,
			which other caches were used, etc.
		It may be useful to store other intermediate results of the process
			that led to the value's being computed successfully
			(e.g., a trap that was found and may crop up again.)
			This also includes storing some abstract, partially-evaluated
			symbolic expressions which generalize the value and its
			computation.
		If possible, predict (criteria, time-limit, etc.) conditions under
			which this will no longer be current. E.g., ``Depends
			on the definition of Y being known, and on there being 
			no examples of Z known, and...'')
			Include a discussion of the certainty of the result.
		It may be worthwhile to {\it stack} the previous cached values,
		  	enabling the program to determine if (when, how, how often)
			they're changing.
		If you choose not to cache, you may wish to store a marker recording
			-- and hopefully explaining -- that decision.


      How to eliminate caches

		Just reverse the criteria in ``When" (above), to see if it
			is no longer cost-effective to keep the cache.
		Remember that F(x) will occupy varying amounts of space,
			varying with F, x, and time. 
			AM, e.g., eliminated largest cached values first.
		Eliminate least frequently (or least recently) used caches first.
		In general, eliminate the caches that are least cost-effective.


UPDATING PRINCIPLES
-------------------


       When

		If the current request would have produced a different answer.
		Usually, only do this if the current request will produce a
			distinctly {\it better} answer. For instance:
		If the current request specifies more precision in the answer.
		If the resource allocation is high enough to get a better
			or more current answer than the one cached already.
		Trivially:  If there is no cached answer available.
		Typically: x or the definition of F changed since F(x) was cached.


       Why

		Frame problem: if the value is old, it's hard to know for sure that
			the world hasn't changed since it was computed.
		If new value is computed (either intentionlly or accidentally),
			and its freshness (or the other conditions of its call)
			makes it ``better" than the existing cached value.

       How

		Discard the old value, or remember it elsewhere (maybe on disk).
		Can use demon traps that flag the cache as out of date, without
			bothering to recompute the value if not required to.
		If the new value is found, simply store it.
		The user may want to know about the updating.
		Many other cached values may depend upon this one, and they may have
			to be ferreted out and updated also, or at least some
			information should be left posted so they can later determine
			that they may be out of date.
		Usually: discard the whole old value, store just the new one.
		If the program is very smart: When the world changes, try to
			propogate that change through the network of caches,
			changing {\it them}. This is much better than erasing them!

[[Should we give an example of this "propagation" in action (prob. fictitious)??]]


*******************************************************************




Caching, as we have described it, is merely the lowest-level of a tier of
data reduction processes, techniques for compiling hindsight into usable,
{\it more efficient}, more procedural forms.
To see this, consider the progression of ever more sophisticated tasks
which programs perform [see Fig. n]: accessing a datum, calculating a
value from a fast algorithm, deducing a result by searching for a solution
or proof in a well-defined space, 
inducing a result by a very open-ended search process.
Often there is the opportunity to take experiences at one of these levels of
sophistication and abstract them into an entity which is one level lower,
yet which appears to contain almost the same power. 

Consider first the way in which we reduce induction. After
much experience with inducing, we produce a {\it heuristic rule} which
enables the same inferencing to be done in a much more controlled space,
in a much more deductive way.  Heuristics reduce discovery tasks to mere
(!) symbolic manipulation within a well-defined system (as in AM).

Deduction is reduced similarly. By many painful deductions, we gain knowledge
of the space we are searching, e.g. by deriving many theorems about it.
Those theorems can then be used to vastly simplify our search. Ultimately,
the process becomes efficient calculation. A short, optimal algorithm may
be found, e.g., which replaces a search for a solution (this has recently
occurred with the ``n queens'' chess problem.)

Now we see that our caching process is the next step in this progression.
It reduces a calculation (e.g., to traverse a particular access path to
gather all Ancestors of a node) by a lookup, by a single access of a datum
in memory.



[[here is that fig]]


ACCESS==> CALCULATE ==> DEDUCE ==> INDUCE BY OPEN=ENDED SEARCH

   \       /      \       /  \      /

     cache          thm, alg   heur rule

[[note: saving multiple
instantiations of the same rule at different levels of abstraction is
NOT worth mentioning, I think, since it is a STATIC (eg planning) tactic;
I may be wrong, and perhaps we should at least mention this v. briefly anyway]]

[[Should we metnion explcitly the loss that may occur if the task involves
a lot of reasoning one's
way through delicate inference chains efficiently; or: a lot of
synthetic reasoning or rule interaction?]]

\NSECP{EXPECTATION-SIMPLIFIED PROCESSING: INTELLIGENT FOCUS OF ATTENTION}

	Central notion:  reserve your computing for opportunities to realize
		potential for expanding knowledge
		Ie: be willing and able to add new facts,
			but be eager to add surprising new facts.
		Equivalently: by being guided by expectations (scripts, frames
			you will find yourself zipping through 90% of what
			comes in from he world, and spending most of your
			processing time on the remaining puzzles, anomalies.
	You may decide how much to expend re-confirming the expected
		You may decide how much time to spend deciding how much time..
	Reductions realizable through expectations:
		Perceptual set:  see what you expect with less effort
		Surprise:  heighten sensitivity to data inconsistent with
			expectations
			Perhaps mention analogy to frog's visual system.
		Predict and prepare
			This includes a whole continuum, growing more
			flexible, more descriptive (less scalar), dynamic:
		  > Store a value, by binding a variable to it, so you can
			later refer to it (retrieve it) by name.
		  > Cache the value of a function-call, so if you ever want
			to find F(x), first you look to see if that value is
			already computed and stored away, else you compute it.
			Note that this is much like hashing, like associative
			memory: the idea that you have a place where F(x)
			should be (the hash of the string PARENTS(FRED), eg)
			and you look there first. This is just a genl. of
			the previous ">" about binding a variable to a value.
		   > After finding F(x), store the information that would
			enable you to recompute it rapidly.  Hopefully, this
			will also help you compute F(y) (either for all y
			or at least for some y's similar to x), and possibly
			will even help you compute G(x), where G is similar
			to F (and, much more rarely, G(y)).
			The idea here is to cache an algorithm, or at least
			constraints on a search, for finding F(x).
			One very special case of this is normal compiling:
			the compiled code for F is just a more efficient way
			of finding values of F. As with our above remarks, one
			often finds his code worked interpreted but not
			compiled. This is considered "wrong" (ie a bug in the
			compiler), but is much more in line with our general
			idea of mere assistance, which is sometimes quite
			useful but may sometimes not help you at all.
		   > Synthesize and store away a partial mode of the world.
			This may include a (sbu)network of frames, which
			together capture enough about the situation that
			in the future its relevance will be recognized, and
			some aspects of this situation will help guide future
			processing on some problem (which is recognized as
			similar).  This is much like analogy, like real
			"learning" (storing away experiences).

	What mechanisms are implicated?
		Caching
		  This should encompass also normal Compiling of subroutines.
		PDMs (as triggers or demons)
		  The analogy: you hear someone yell "Fire!" in a theater.
		  Also: mention the basis for much of humor: the punchline.
			The intentional misleading initially (wrong LHS)
			An unexpected reaction (wrong RHS)
			This may even provide a start at a taxonomy of
				humor, one which coud be operationalized.
	Relevance to learning
		Confirm or disconfirm predictors
		This requires setting up PDMs to fire on dis/confirmation
		Yes, the idea is that when an expectation is not met, we
			also (have some chance of) modifying the culprit
			rule(s) that made that false prediction (and/or
			adding new rule(s) which would make the right one.
			THis is v. similar to Mycin's task, and the TEIR.
			approach is worth mentioning.
		This ties learning in very closely to scientific theory
			formation (at least the theories of knowledge which
			are activist, revolutionary conventionalist activist,
			methodological falsificationist, sophisticated.
			[those are the nodes descending in a genl/spec hier
			of theories of knowledge]


\NSECP{LEVELS OF ABSTRACTION:  INTELLIGENT KNOWLEDGE STRUCTURING}


In many real-life problem solving situations, the "goal" is not a single,
precise answer. Rather, there is a cost/benefit tradeoff between the
accuracy of the solution and the amount of resources consumed in producing
it.  As with any such tradeoff, there is typically a point of diminishing
returns.  Computer programs which are designed only to return exact answers
will then be ill-suited to the needs of the various situations in which it
will be called upon.  One area in which this is recognized in AI is game-
playing.  The concept of truncated search is fundamental in all chess programs
Yet very little of the same philosophy has permeated to other kinds of
AI programs, let alone programs in other spheres.


[this last para. needs reworking; my analogy to truncated search may confuse
more than illuminate.]

As we discussed in Section 2.1, one of our fundamental assumptions is
that the environment is continuous. This is the justification for believing
a truncated search is better than random selection of a legal move. We assume
-- or must work to assure -- that our abstractions are approximations to the
next level of detail, which in turn approximates the next more detailed level,
etc. (perhaps all the way down to the ``real world", to the true leaves of the search
tree).  An apparent exception is the situation of a rapidly changing
environment. But if the world keeps changing, then that character of it is
remaining more or less constant, and the program might hope to tailor itself
into a form that was especially good at rapidly adapting, even if it were
quite slow for any given fixed environment.  In fact, such an adaptable
program would share much in common with what we've termed ``expressive"
programs.\foo{ An instance of this of particular interest to us is biological
evolution: The physical world, and ecosphere, have always been rapidly
changing. It makes sense, then, to expect that advanced organisms would
have evolved good methods for quick adaptation; e.g., their progeny would
vary because of some expressive ``plausible mutation" mechanism, rather than by
some efficient streamlined one, or (even less likely) by the purely random
generation of mutation mechanism which underlies Darwinism and neodarwinism.
``Mutation" here encompasses low-level sequence mutations, high-level
recombination, gene duplication, sexual combination, etc.
One should not attempt to take on an entire established paradigm in a
footnote, so further discussion of evolution by plausible mutation will be
deferred to a later paper.}

Let's look at a couple examples illustrating the omnipresent need to
approximate reality -- a need which is well met by maintaining multiple
representations of the world  at various levels of abstraction;
i.e., an entry in the knowledge base has all its generalizations also
stored explicitly, redundantly.

A train accident occurs, and over two hundred people are injured; the
only large nearby hospital is called, and asked if they can handle the load.
The administrator queries his shiny new simulator, and it begins to chug
away.  It looks over the state of each current patient, each physician,
each hospital resource.  It begins simulating the arrival of the wounded,
their movement and temporary storage in hallways, etc.  After quite a while,
it shows that the hospital will be saturated and unable to care for any more
incoming patients.  The maximimum they can accept is 157 wounded passengers.
If ``quite a while" is several hours (which it very likely will be), this is
simply inappropriate. There is a pressing need for a rough answer quickly;
a {\it human} hospital administrator would answer right away, or after a
very brief survey of a few parameters, and {\it so should
an intelligent program}.

A colonel monitoring our defense systems detects a fleet of enemy aircraft
attacking in an unexpected pattern. Can our bombers and missles meet this
new attack mode?  The colonel will know precisely how much time is available
to answer that question, and it will be a matter of seconds or minutes, not
of hours or weeks.  A detailed simulation program will be of no use to him.
Reasoning will go on at a high level of abstraction, and an approximate answer
will be forthcoming (either from a program capable of such inference, or
-- in lieu of that -- in the colonel's head).


[what are some other possible examples to use here?  We should pick say two,
and use the same ones throughout the paper, whenever possible.]

In addition to aiding us in quickly approximating reality, multiple levels
of abstraction (MLA) are useful for
{\it analogizing}: they can 
make that process easier, faster, and occasioanlly even
more valid. Analogies can be recognized when two distinct concepts turn into
the very same abstraction at some level. Conversely, if two apparently related
concepts (such as two homonyms) do not share related structure at higher
levels of abstraction, that suggests that an analogy between them would be
superficial and lacking in power.  Even more strongly, consider the situation
where a general model of page usage exists for some operating system, and
at the next lower level are 
four mutually inconsistent, specialized, detailed models. It may
be that no one of the detailed models can provide results as good, on
the average, as the general model, which is thus more valid.


In several ways, we have seen how MLA can be an instrument of efficiency.
But this is very strange: we earlier said that reasoning at multiple levels of
abstraction was a technique for heightening expressiveness, not efficiency.
We have spent most of this paper discussing dynamic methods for approximating
reality (e.g., the cached value for Generalizations of Primes may suffice if
a rough answer is needed quickly).  Although the technique of reasoning at
multiple levels of abstraction is much more ``static'', its usage in AI programs
to date has been primarily for expressiveness, not for efficiency. We wish to
clarify how it serve both apparently conflicting goals.

Is this such a powerful ability that it helps improve both simultaneously,
or is something else going on?  The answer lies, significantly, in the
nature of the task environment.  

Multiple levels of abstraction cost a certain
amount to implement, both initially (man-hours of programming) and as the
program runs (cpu time spent maintaining the redundancy, choosing which level
to work on next, etc.) For problems of all sizes, this feature aids in the
expression of new problems to the program, since they may be input at whatever
level(s) seem appropriate to the user. Thus multiple levels of abstraction
will always add to expressiveness, and will either detract or add to
efficiency depending upon how small or large the task is. And as we showed in
a recent paragraph, MLA is very useful when the program's resources are
curtailed or, even better, varied dramatically in magnitude from run to run
(a program which was {\it always} limited to a 15-second execution time would
be designed specifically for that task, but a program which might get anywhere
from 5 seconds to 5 hours for the same simulation question could better
benefit from multiple levels of abstraction.)
The graph below summarizes these two results: it pictures MLA
contributing more to efficiency
as the quotient (task size)/(resource allotment) increases in size.

[[space for graph]]

\yskip 3in






There is a more general result here: the methods which humans find useful for
easy expression of problems may be useful to programs attacking very large
problems.  For instance, consider the decision about whether to do inference
by a system of pattern-invoked experts, or rather to preprocess the triggering
conditions into a big discrimination network. The former approach is more
expressive, the latter more efficient -- at least for small problems. For
very large tasks, the need to remain flexible grows very large. It is then
more economical to retain pattern-directed control (the compilation of the
patterns is so complex the whole procedure would have to be redone
frequently during the course of execution, if we opted for that alternative
instead.)  A case in point is the HARPY speech understanding program [ref]:
whenever the grammar changes -- at any level -- the entire speech recognition
network must be recompiled. HEARSAY II, while slower in performance, has
sufficient flexibility (from several independent pattern-directed knowledge
sources) not to suffer from such recompilation delays.


[[Is any of the following appropriate for inclusion, either in this section
or anywhere else in the paper?  It could be used to build a full explanation
of MLA, which we simply assume readers grok completely.]]

	Conceptual hierarchies
		Hier. in general: possible organizing relations
			Supervises, commands, controls, calls (as in subroutin
			Containment: part-of versus specialization-of versus
					application-of versus element-of

		Inheritability: the importance of "or any ancestor"
			The notion that the hier. is a compact way of storing
				a huge set of assertions, is fairly efficient
				vis a vis retrieving / testing for them, and
				is much much more efficient for updating them.
			This discussion should be general to all above
				organizing relns, not just genl/spec hiers.
		Make sure to distinguish our meaning for abstraction
			(redundant, generalized version) from the other
			common meaning (inducing a general case from examples)

	Heuristic Hierarchies
		Each heuristic H has a domain of relevance
		More precisely, H's relevance is a function of task posed,
			and is high for some (it domain of relevance),
			low for many more, and perhaps negative for others.
		Heurisitcs can be related to each other by genl/specialization
			[example: "If P(x) is often false, weaken P" -->...
				Also an example involving simulation.]
		Heuristics can be object-centered (tacked onto relevant
			concepts) and can therefore inherit THEIR relationship
			[an ex. of that..  This is "Inheritability" in AM
			sense.]


	Interpretation and planning at levels of abstraction
		Eg., rules of bomber simulation at difft levels
		(this example can be one which is used earlier to suggest
		 caching for simplification)
		[this ties in with the last point about heur. hier.]
		Part of this para. might be this example:

``In the short run, what would happen if we didn't take any defensive
actions in a full-scale enemy attack situation?"  A strategist can
answer this instantly, to the effect that the enemy missles and bombs
would then reach their targets, destruction would be nearly total, our
capacity for retaliation would be severely reduced, and so on. He can
say this without factoring in the individual changes in aileron
attitutde throughout the flight of each bomber.
He has a high-level ``script" which specifies defensive action against an
attack, with a small note of explanation (e.g., ``else destruction"), and
that's as detailed as he needs to get to answer the question.
Now consider a question like ``What if we had a fleet of 200 B1 bombers
in addition to our current forces?" The strategist may again be capable
of giving a quick estimate of the effect, but it's possible that a more
detailed analysis would reveal some cusp, some nonlinearity not apparent
to him through simple extrapolation, through the application of general,
high-level rules of thumb.  One strategy might be to employ his old, general,
high-level scenario as a plan, as a set of general expectations
which guide the detailed simulator, which predict which aspects of the
world will benefit from much more detailed simulation (e.g., encounters where
our bombers hitherto failed to reach their targets), and which areas will
probably be unaffected by the change in the world (e.g., attacks on our
coast by enemy submarines, attackes which succeeded before).
[note that we can easily extend this discussion a bit for caching: keep
around the old simulation, at various levels, and only re-run those parts
which might be affected by the change in initial situation, and even then
only rerun it at the hihest level possible.  This latter decision is probably
always made locally: run until the results stop changing, then you know
you've expanded down one level too far, so you can back up and stop.]


	Related research
		Include mention of psych. studies showing that people may
			reason at the word level, rather than more primitive
			level, because the high-level word manipulation
			process is more efficient than the manipulation of
			large, detailed structures (such as huge sem. nets).


\NSECP{COGNITIVE ECONOMY REVISITED}

[let's leave this section until last, till the others are written out]

	Sample problem:  using a world model (simulator) to answer questions
		(e.g., what'd happen if 100 bombers went in there?)
		Representation of this knowledge as PDMs at difft levels of abstn
		Ability to generalize and cache results at one level at the
			next higher level,
				e.g. either as numerical tables, stat. distns, or
				symbolic expressions
		Ability to answer some questions appropriately for the requestor
			at a high level of abstraction

	KB Design
		One good reason to use inheritanc is to speed knowledge
			implementation, not computing performance
		Using the system should result in its speedup
		Storage should be cheap
	Machine architecture
		PDI should be cheap
		PDMs should be scheduled with variable resources and
			should be able to spend effort accordingly
		How could propagation of changes be made efficient?



\NNSECP{Acknowledgements}

We wish to thank...

\vfill\end